library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(corrplot)
library(gt)
library(gtsummary)
library(scales)
library(patchwork)
library(ggthemes)
library(ggplot2)
library(RColorBrewer)
library(plotly)
library(leaflet)
library(tidygeocoder)
library(htmlwidgets)
library(flexdashboard)
library(shiny)
library(DT)
library(summarytools)
library(arules)
library(arulesViz)
library(prophet)
library(forecast)
library(scales)
library(data.table)
library(rfm)
This report analyzes sales data from a technology retail store for the period 2019–2020.
Objectives:
Data:
The analysis is based on a csv file containing sales records (order_id, products, addresses, etc.). The data source is a local file, with approximately 186,000 rows.
data = read.csv("C:/Users/ibish/Desktop/data/Sales Data.csv")
names(data)
## [1] "X" "Order.ID" "Product" "Quantity.Ordered"
## [5] "Price.Each" "Order.Date" "Purchase.Address"
head(data)
## X Order.ID Product Quantity.Ordered Price.Each Order.Date
## 1 0 295665 Macbook Pro Laptop 1 1700.00 30/12/2019 00:01
## 2 1 295666 LG Washing Machine 1 600.00 29/12/2019 07:03
## 3 2 295667 USB-C Charging Cable 1 11.95 12/12/2019 18:21
## 4 3 295668 27in FHD Monitor 1 149.99 22/12/2019 15:13
## 5 4 295669 USB-C Charging Cable 1 11.95 18/12/2019 12:38
## 6 5 295670 AA Batteries (4-pack) 1 3.84 31/12/2019 22:58
## Purchase.Address
## 1 136 Church St, New York City, NY 10001
## 2 562 2nd St, New York City, NY 10001
## 3 277 Main St, New York City, NY 10001
## 4 410 6th St, San Francisco, CA 94016
## 5 43 Hill St, Atlanta, GA 30301
## 6 200 Jefferson St, New York City, NY 10001
tail(data)
## X Order.ID Product Quantity.Ordered Price.Each
## 185945 13616 222904 Macbook Pro Laptop 1 1700.00
## 185946 13617 222905 AAA Batteries (4-pack) 1 2.99
## 185947 13618 222906 27in FHD Monitor 1 149.99
## 185948 13619 222907 USB-C Charging Cable 1 11.95
## 185949 13620 222908 USB-C Charging Cable 1 11.95
## 185950 13621 222909 AAA Batteries (4-pack) 1 2.99
## Order.Date Purchase.Address
## 185945 09/06/2019 22:07 975 2nd St, Los Angeles, CA 90001
## 185946 07/06/2019 19:02 795 Pine St, Boston, MA 02215
## 185947 01/06/2019 19:29 495 North St, New York City, NY 10001
## 185948 22/06/2019 18:57 319 Ridge St, San Francisco, CA 94016
## 185949 26/06/2019 18:35 916 Main St, San Francisco, CA 94016
## 185950 25/06/2019 14:33 209 11th St, Atlanta, GA 30301
skim(data)
| Name | data |
| Number of rows | 185950 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Product | 0 | 1 | 6 | 26 | 0 | 19 | 0 |
| Order.Date | 0 | 1 | 16 | 16 | 0 | 142395 | 0 |
| Purchase.Address | 0 | 1 | 26 | 42 | 0 | 140787 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1 | 8340.39 | 5450.55 | 0.00 | 3894.00 | 7786.00 | 11872.0 | 25116 | ▇▇▆▂▁ |
| Order.ID | 0 | 1 | 230417.57 | 51512.74 | 141234.00 | 185831.25 | 230367.50 | 275035.8 | 319670 | ▇▇▇▇▇ |
| Quantity.Ordered | 0 | 1 | 1.12 | 0.44 | 1.00 | 1.00 | 1.00 | 1.0 | 9 | ▇▁▁▁▁ |
| Price.Each | 0 | 1 | 184.40 | 332.73 | 2.99 | 11.95 | 14.95 | 150.0 | 1700 | ▇▁▁▁▁ |
Commentary:
Some auxiliary columns are missing, and there are duplicate order_id values. The first column (X) represents row numbers from the original file and will be removed. The order_date column needs to be converted to the POSIXct format.
data$X=NULL
data = clean_names(data)
data$order_date = dmy_hm(data$order_date)
data = data.table(data)
data[, state := sub("^[^,]*,([^,]*),.*$","\\1",purchase_address)] # State extraction
data[, sales := as.double(quantity_ordered * price_each)]
data[, clean_date := as.Date(order_date)]
Commentary:
All column names were converted to snake_case. The order_date format was updated.State names were extracted from the full address.Total sales was calculated.
n1 = data %>%
group_by(month = as.Date(order_date)) %>%
summarise(tot_sales=sum(sales))
n1 = data %>%
# Используем floor_date, чтобы превратить "2019-01-15" в "2019-01-01"
group_by(month = floor_date(as.Date(order_date), "month")) %>%
summarise(tot_sales = sum(sales))
ggplotly(ggplot(n1, aes(x = month, y = tot_sales)) +
geom_line(color = "darkblue", size = 1) +
geom_point(color = "darkblue") +
scale_x_date(
breaks = "1 month",
labels = label_date_short(),
expand = c(0.05, 0.05)) +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Total sales by Month", x = "Month", y = "Total Sales"))
Commentary:
From a business analysis perspective, the line chart indicates a clear sales peak in December 2019 driven by seasonal holiday demand. A noticeable increase in sales is also observed in September 2019 compared to the previous month, which can be attributed to the start of the academic season, changes in consumer behavior after the summer, and promotional campaigns by retail chains. The lower value observed in January 2020 is not indicative of a decline in performance and is explained by the fact that the data covers only the first day of the month. Therefore, this observation does not represent a data anomaly but rather a limitation of the reporting period.
n2 = data %>% group_by(product) %>% summarise(tot_sales = sum(sales),qnt = sum(quantity_ordered))
n2 = n2 %>% slice_max(tot_sales,n = 10)
ggplotly(ggplot(n2, aes(reorder(product, tot_sales), tot_sales, fill = tot_sales)) +
geom_col() + coord_flip() + scale_fill_gradient(low = "lightblue", high = "darkblue") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_minimal() + theme(legend.position = "none") + labs(title = "Top 10 Products by Sales", x = "Product", y = "Total Sales"))
Commentary:
Apple products dominate the top-10 list (with Bose as a subsidiary brand). Monitors are also among the most popular items. Recommendation: focus on these segments by strengthening promotions and ensuring sufficient stock levels.
state_sales = data %>%
group_by(state) %>%
summarise(tot_sales = sum(sales))
state_sales = state_sales %>% geocode(state, method = "osm")
state_sales = data.table(state_sales)
state_sales[, share := percent(round(tot_sales/sum(tot_sales),3))]
popup_text = paste0("<b>", state_sales$state,"</b><br>",
"Total Sales: ", format(state_sales$tot_sales, big.mark = " "),"<br>",
"Share of total: ", state_sales$share)
leaflet(state_sales) %>% addTiles() %>% addCircles(lng = ~long, lat = ~lat, weight = ~tot_sales/1e5, popup = ~popup_text)
Commentary:
Sales are highly concentrated in San Francisco, Los Angeles, and New York, indicating strong market penetration in these cities. At the same time, other regions underperform.
Recommendation: deploy targeted promotional and marketing initiatives in weaker regions to stimulate demand and reduce dependency on a small number of key markets.
abc = data %>%
group_by(product) %>%
summarise(tot_sales = sum(sales)) %>%
arrange(desc(tot_sales))
abc = abc %>%
mutate(
share = tot_sales / sum(tot_sales), # Individual share
cum_share = cumsum(share), # Cumulative share
abc_class = case_when(
cum_share < 0.8 ~ "A",
cum_share < 0.95 ~ "B",
TRUE ~ "C"))
abc$product <- factor(abc$product, levels = abc$product)
ggplot(abc, aes(x = product)) +
geom_col(aes(y = tot_sales), fill = "steelblue") + # один цвет для всех столбцов
geom_line(aes(y = cum_share * max(tot_sales), group = 1), color = "red", size = 1) +
geom_point(aes(y = cum_share * max(tot_sales)), color = "red", size = 2) +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()),
name = "Total Sales",
sec.axis = sec_axis(~./max(abc$tot_sales), name = "Cumulative Share", labels = percent)
) +
labs(x = "Product") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Commentary:
The Pareto chart shows that a relatively small group of products accounts for the majority of total sales. The cumulative curve rises quickly, indicating that top-performing items contribute a disproportionate share of revenue, while the remaining products add value more gradually.
Most of the total sales volume is generated before reaching the 80% cumulative threshold, which confirms a strong concentration of revenue around key products. Beyond this point, additional products contribute marginally, suggesting diminishing returns from the long tail of the assortment.
From a business perspective, this implies that sales performance is heavily driven by a limited number of high-impact products. Focusing inventory management, pricing, and promotional efforts on these core items is likely to yield the greatest return, while lower-contributing products should be reviewed for optimization, bundling, or rationalization.
data[, clean_date := as.Date(order_date)]
analysis_date = as.Date("2020-01-02")
rfm = data %>% group_by(purchase_address) %>% summarise(sales, order_date, recency = as.numeric(analysis_date - max(clean_date)),
frequency = n(),
monetary = sum(sales))
rfm = data.table(rfm)
rfm[, `:=`(
r_score = ntile(-recency, 5),
f_score = ntile(frequency, 5),
m_score = ntile(monetary, 5)
)]
rfm[, rfm_score := paste0(r_score, f_score, m_score)]
rfm[, segment := fcase(
r_score >= 4 & f_score >= 4 & m_score >= 4,
"Champions",
r_score >= 3 & f_score >= 4,
"Loyal",
m_score >= 4 & f_score <= 3,
"High Value",
r_score >= 4 & f_score <= 3,
"New / Promising",
r_score <= 2 & f_score >= 3,
"At Risk",
r_score <= 2 & f_score <= 2,
"Lost",
default = "Others"
)]
rfm_tab1 = rfm %>% group_by(purchase_address) %>% summarise(purchase_address,rfm_score,segment)
datatable(rfm_tab1,
extensions = 'Buttons',
options = list(paging = TRUE,
searching = TRUE,
fixedColumns = TRUE,
autoWidth = TRUE,
ordering = TRUE,
dom = 'Bftp',
buttons = c('copy', 'csv', 'excel','print')
))